[serve][llm][transcription] Add support for Transcription in vLLM engine backend by Blaze-DSP · Pull Request #57194 · ray-project/ray

Blaze-DSP · 2025-10-04T20:49:46Z

Why are these changes needed?

Expose an transcriptions API like [https://platform.openai.com/docs/api-reference/audio] using vLLM

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

gemini-code-assist

Code Review

This pull request introduces a new transcription API, following the OpenAI specification. The changes are well-structured, touching the necessary model definitions, LLM server, vLLM engine, and router components. The implementation largely follows existing patterns in the codebase. However, I've identified a couple of critical issues that would cause runtime errors, such as a missing comma in a type hint and a method name mismatch between the server and the engine. There are also some minor maintainability issues like a copy-pasted comment and a typo in a docstring. Addressing these points will make the PR ready for merging.

python/ray/llm/_internal/serve/deployments/llm/llm_server.py

python/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py

python/ray/llm/_internal/serve/deployments/llm/llm_server.py

python/ray/llm/_internal/serve/deployments/routers/router.py

Copilot

Pull Request Overview

This PR introduces support for a transcription API to vLLM's OpenAI-compatible interface, following the OpenAI audio/transcriptions API specification. The implementation adds the necessary request/response models, router endpoints, and engine integration to handle audio transcription requests.

Adds TranscriptionRequest, TranscriptionResponse, and TranscriptionStreamResponse models
Implements /v1/audio/transcriptions endpoint in the router
Integrates transcription support into the vLLM engine with proper error handling

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
python/ray/serve/llm/openai_api_models.py	Adds public API models for transcription request/response types
python/ray/llm/_internal/serve/deployments/routers/router.py	Implements transcription endpoint and updates request processing logic
python/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py	Adds transcription engine integration with vLLM OpenAI serving
python/ray/llm/_internal/serve/deployments/llm/llm_server.py	Adds transcription method to LLM server with async generator interface
python/ray/llm/_internal/serve/configs/openai_api_models.py	Defines internal transcription models and response type unions

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

python/ray/llm/_internal/serve/deployments/routers/router.py

python/ray/llm/_internal/serve/engines/vllm/vllm_engine.py

python/ray/llm/_internal/serve/core/server/llm_server.py

python/ray/llm/_internal/serve/configs/openai_api_models.py

kouroshHakha

Nice. I think the basic feature looks good. We need to just add CI tests and some release tests as well.

For CI please take a look at existing tests for the endpoints at engine and router levels. Here are some I found:

https://github.com/ray-project/ray/blob/master/python/ray/llm/tests/serve/cpu/deployments/llm/test_llm_engine.py: includes tests for engine interfaces
https://github.com/ray-project/ray/blob/master/python/ray/llm/tests/serve/cpu/deployments/llm/test_llm_server.py: includes tests for LLMServer interfaces.

You would need to create a mock engine with some reasonable transcription behavior.

Let's keep the translation for another PR after we cover everything for this new endpoint.

For release test, could you share the serve run script that you used to validate the behavior along with the client code and expected output. We can turn that into a gpu release test with a real model (maybe using whisper-tiny, etc) so that it is continuously tested.

Blaze-DSP · 2025-10-07T20:36:14Z

@kouroshHakha CI tests have been written and docs have also been updated. Pls check and verify.

If we are going to adopt vllm==v0.11.0, then v0 has been entirely been depricated and all models are supported via the v1 engine. Need to make appropriate changes for docs, etc(for e.g., embeddings).

kouroshHakha

Adding @eicherseiji for review.

doc/source/serve/llm/quick-start.rst

python/ray/llm/_internal/serve/core/configs/openai_api_models.py

eicherseiji

Recommend pip install pre-commit && pre-commit install before a lint commit to satisfy the CI.

For a release test, recommend

ray/release/llm_tests/serve/test_llm_serve_correctness.py

Line 187 in 067c02a

def test_llm_serve_correctness(

and

ray/release/release_tests.yaml

Line 3781 in 067c02a

- name: llm_serve_correctness

for examples.

Looks like we're in pretty good shape though. Just a few comments + release test and we should be good. Thanks!

doc/source/serve/llm/quick-start.rst

python/ray/llm/_internal/serve/deployments/routers/router.py

cursor · 2025-10-23T17:03:06Z

Bug: Batched Response Type Handling Issue

The condition isinstance(first_chunk, NON_STREAMING_RESPONSE_TYPES) checks if first_chunk is an instance of a tuple of types, but NON_STREAMING_RESPONSE_TYPES is defined as a tuple containing ChatCompletionResponse, CompletionResponse, and TranscriptionResponse. However, when batching is enabled, first_chunk could be extracted from a list (line 538-541), and the isinstance check should handle both the direct response object and the case where it's wrapped in a list. The current logic may incorrectly identify a non-streaming response when the first item in a batched list happens to be one of these types, even though the overall response is streaming.

cursor · 2025-10-23T17:03:07Z

Bug: File Stream Exhaustion in Transcription Method

The transcriptions method consumes the request.file stream when reading audio data. This prevents subsequent operations, such as retries or logging, from accessing the file content.

cursor · 2025-10-23T17:03:07Z

Bug: Streaming Response Type Inconsistency

The type annotations for LLMChatResponse, LLMCompletionsResponse, and LLMTranscriptionResponse are inconsistent regarding streaming responses. They currently include both str (for SSE format) and specific *StreamResponse objects, which creates ambiguity about the actual type yielded during streaming.

cursor · 2025-10-23T17:03:08Z

Bug: Bug

The request_id field is added to TranscriptionRequest with a default factory that generates a random UUID. However, based on the PR discussion comment "@kouroshHakha: do we need request_id here?", this field may not be needed. More critically, looking at the context files, CompletionRequest and EmbeddingCompletionRequest have similar request_id fields with TODO comments indicating they should be upstreamed to vLLM. The issue is that TranscriptionRequest inherits from vLLMTranscriptionRequest, and if vLLM's base class doesn't have this field, adding it here could cause serialization/deserialization issues when the request is passed to vLLM's engine, similar to the Pydantic ValidatorIterator issue mentioned in the code comments. The field should either be removed or the same TODO comment should be added indicating it needs to be upstreamed.

kouroshHakha · 2025-10-23T17:54:33Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a new transcription API, which is a fantastic addition to the LLM serving capabilities. The implementation is well-structured, with changes spanning from the public API models to the vLLM engine, and includes comprehensive tests. The refactoring in ingress.py to generalize request processing is a notable improvement for maintainability. I have a few suggestions regarding potential memory usage with large audio files, improving the clarity of the documentation example, and fixing a broken link in a TODO comment. Overall, this is a solid contribution.

gemini-code-assist · 2025-10-23T17:56:20Z

doc/source/llm/doc_code/serve/transcription/transcription_example.py

+voxtral_llm_config = LLMConfig(
+    model_loading_config={
+        "model_id": "voxtral-mini",
+        "model_source": "mistralai/Voxtral-Mini-3B-2507",
+    },
+    deployment_config={
+        "autoscaling_config": {
+            "min_replicas": 1,
+            "max_replicas": 2,
+        }
+    },
+    accelerator_type="A10G",
+    # You can customize the engine arguments (e.g. vLLM engine kwargs)
+    engine_kwargs={
+        "tokenizer_mode": "mistral",
+        "config_format": "mistral",
+        "load_format": "mistral",
+    },
+    log_engine_metrics=True,
+)
+
+app = build_openai_app({"llm_configs": [whisper_llm_config, voxtral_llm_config]})


This example, which is focused on transcriptions, also includes the configuration for a second, unrelated model (voxtral-mini). This could be confusing for users who are looking for a minimal, focused example on how to set up a transcription service.

For better clarity and to make the example easier to copy and adapt, I recommend removing the voxtral_llm_config and updating the app creation on line 80 to only use the whisper_llm_config, like this:

app = build_openai_app({"llm_configs": [whisper_llm_config]})

This will make the example more direct and easier to follow for the specific use case of transcriptions.

I agree, should we just keep voxtral_llm_config which also shows the engine_kwargs?

python/ray/llm/_internal/serve/engines/vllm/vllm_engine.py

gemini-code-assist · 2025-10-23T17:56:20Z

python/ray/llm/_internal/serve/engines/vllm/vllm_engine.py

+        raw_request = self._create_raw_request(request, "/audio/transcriptions")
+
+        # Extract audio data from the request file
+        audio_data = await request.file.read()


Reading the entire audio file into memory with await request.file.read() could lead to high memory consumption, especially with large audio files (the OpenAI API limit is 25MB) and concurrent requests. This might risk Out-Of-Memory errors on the replica.

If the underlying create_transcription method in vLLM supports it, consider streaming the file content instead of reading it all at once. This could be done by passing a file-like object or an async generator. If vLLM requires the full byte string, this is an acceptable limitation, but it's an important performance consideration to be aware of.

kouroshHakha

Nice. Looks great. Just a few nits before wrapping up this PR:

kouroshHakha · 2025-10-23T17:57:48Z

python/setup.py

                "async-timeout; python_version < '3.11'",
                "typer",
+                "meson",
+                "pybind11",


@eicherseiji a chore we should do is to make this part of setup.py read the llm-requirements.txt directly so we just update one source of truth down the line.

doc/source/serve/llm/user-guides/vllm-compatibility.md

kouroshHakha · 2025-10-23T18:01:23Z

doc/source/llm/doc_code/serve/transcription/transcription_example.py

+voxtral_llm_config = LLMConfig(
+    model_loading_config={
+        "model_id": "voxtral-mini",
+        "model_source": "mistralai/Voxtral-Mini-3B-2507",
+    },
+    deployment_config={
+        "autoscaling_config": {
+            "min_replicas": 1,
+            "max_replicas": 2,
+        }
+    },
+    accelerator_type="A10G",
+    # You can customize the engine arguments (e.g. vLLM engine kwargs)
+    engine_kwargs={
+        "tokenizer_mode": "mistral",
+        "config_format": "mistral",
+        "load_format": "mistral",
+    },
+    log_engine_metrics=True,
+)
+
+app = build_openai_app({"llm_configs": [whisper_llm_config, voxtral_llm_config]})


I agree, should we just keep voxtral_llm_config which also shows the engine_kwargs?

kouroshHakha · 2025-10-23T18:03:09Z

python/ray/llm/_internal/serve/core/engine/protocol.py

        """
        pass

+    @abc.abstractmethod


I think we should make this method not abstract. So the subclasses can skip their implementation. We can have a NotImplementedError here. I think this points apply to all the endpoints actually, but came to my mind right now :)

python/ray/llm/_internal/serve/core/ingress/ingress.py

python/ray/llm/_internal/serve/core/server/llm_server.py

python/ray/llm/_internal/serve/engines/vllm/vllm_engine.py

python/ray/llm/tests/serve/conftest.py

kouroshHakha · 2025-10-23T18:16:56Z

making sure release tests pass: https://buildkite.com/ray-project/release/builds/65185

Signed-off-by: DPatel_7 <dpatel@gocommotion.com>

python/ray/llm/_internal/serve/core/ingress/ingress.py

python/ray/llm/_internal/serve/core/configs/openai_api_models.py

python/setup.py

python/ray/llm/_internal/serve/engines/vllm/vllm_engine.py

Blaze-DSP · 2025-10-23T19:07:15Z

@kouroshHakha made appropriate changes from the review.

eicherseiji · 2025-10-23T20:44:56Z

Requested stamp from @richardliaw

kouroshHakha · 2025-10-23T21:31:27Z

@Blaze-DSP the gpu test on the example is failing now (probably due to changing to a different model?) can you take a look. It has a gpu memory problem.

Premerge Build 52450 LLM GPU Tests.log

Signed-off-by: DPatel_7 <dpatel@gocommotion.com>

cursor · 2025-10-24T06:55:44Z

python/ray/llm/_internal/serve/core/configs/openai_api_models.py

+    AsyncGenerator[
+        Union[str, ChatCompletionStreamResponse, ChatCompletionResponse, ErrorResponse],
+        None,
+    ],


Bug: Type Mismatch in Streaming Responses

The LLM*Response type annotations (e.g., LLMChatResponse, LLMCompletionsResponse, LLMTranscriptionResponse) include stream response object types. However, streaming endpoints ultimately yield SSE-formatted strings, not these objects. This creates a mismatch between the declared types and the actual runtime output, which can lead to type checking issues and confusion.

Additional Locations (1)

python/ray/llm/_internal/serve/core/ingress/ingress.py#L549-L550

cursor · 2025-10-24T06:55:44Z

python/ray/llm/_internal/serve/core/configs/openai_api_models.py

+            "not set it, a random_uuid will be generated. This id is used "
+            "through out the inference process and return in response."
+        ),
+    )


Bug: Request ID Conflict in Transcription Classes

The TranscriptionRequest class explicitly adds a request_id field. This could conflict with vLLMTranscriptionRequest if the base class already defines it, potentially causing duplication or unexpected behavior. This also differs from other request types that inherit request_id from their vLLM base classes.

cursor · 2025-10-24T06:55:44Z

python/setup.py

        set(
            [
-                "vllm>=0.11.0",
+                "vllm[audio]>=0.11.0",


Bug: Unnecessary Audio Dependencies in Ray[LLM]

The ray[llm] extra now unconditionally pulls in vllm[audio], adding audio-related dependencies. This increases the dependency footprint for all ray[llm] users, even those not needing transcription features, making the installation heavier than necessary.

cursor · 2025-10-24T06:55:44Z

python/setup.py

                "async-timeout; python_version < '3.11'",
                "typer",
+                "meson",
+                "pybind11",


Bug: Setup.py and Requirements.txt Mismatch

The setup.py adds "meson" and "pybind11" to the llm extras, but these dependencies are not present in the corresponding llm-requirements.txt file (lines 14-15 in the requirements file). While the comment on line 372 states "Keep this in sync with python/requirements/llm/llm-requirements.txt", these two packages are only added to setup.py and not to the requirements file, creating an inconsistency between the two dependency specifications.

cursor · 2025-10-24T06:55:45Z

python/ray/llm/_internal/serve/engines/vllm/vllm_engine.py

+        raw_request = self._create_raw_request(request, "/audio/transcriptions")
+
+        # Extract audio data from the request file
+        audio_data = await request.file.read()


Bug: Audio File Reading and Pointer Issue

The code calls await request.file.read() to extract audio data from the request file. However, this reads the entire file into memory, which could be problematic for large audio files. Additionally, after reading the file once, the file pointer is at the end, so if vLLM's create_transcription method tries to read from request.file again, it will get empty data. The audio_data is extracted but the original request.file is still passed to create_transcription, which may cause issues if vLLM expects an unread file object.

Signed-off-by: DPatel_7 <dpatel@gocommotion.com>

cursor · 2025-10-24T08:33:51Z

python/ray/llm/_internal/serve/core/configs/openai_api_models.py

+            "not set it, a random_uuid will be generated. This id is used "
+            "through out the inference process and return in response."
+        ),
+    )


Bug: UUID Inconsistency Causes vLLM Compatibility Issues

The TranscriptionRequest adds a request_id with a default UUID, which is inconsistent with other request types that mark this field for upstreaming to vLLM. This approach, questioned in a prior discussion, may cause compatibility issues with vLLM's backend.

cursor · 2025-10-24T08:33:51Z

python/ray/llm/_internal/serve/core/configs/openai_api_models.py

-        Union[CompletionStreamResponse, CompletionResponse, ErrorResponse], None
+        Union[str, CompletionStreamResponse, CompletionResponse, ErrorResponse], None
+    ],
+]


Bug: Type Mismatch in Streaming Responses

The LLMChatResponse and LLMCompletionsResponse type annotations include *StreamResponse objects, but the engine's streaming responses actually yield raw strings. This creates a mismatch between the declared types and the actual return values.

cursor · 2025-10-24T08:33:51Z

python/ray/llm/_internal/serve/core/ingress/ingress.py

+    # vLLM implementation for handling transcription requests: https://github.com/vllm-project/vllm/blob/0825197bee8dea547f2ab25f48afd8aea0cd2578/vllm/entrypoints/openai/api_server.py#L839.
+    async def transcriptions(
+        self, body: Annotated[TranscriptionRequest, Form()]
+    ) -> Response:


Bug: Audio File Serialization Issue in Ray Serve

The transcriptions endpoint uses Annotated[TranscriptionRequest, Form()] to handle form data, but the TranscriptionRequest object contains a file field that is an UploadFile. When this is passed through Ray Serve's remote call mechanism (in _get_response), the UploadFile object may not be serializable for pickling, which is required for Ray remote calls. This could cause serialization errors when the request is sent to the model deployment. The audio data should be extracted and passed as bytes before the remote call, similar to what's done in the vLLM engine at line 473.

Blaze-DSP · 2025-10-24T09:58:54Z

mb, fixed the issue @kouroshHakha

eicherseiji · 2025-10-24T16:12:29Z

Release test llm_batch_vllm_multi_node is jailed on master, failure can be ignored: https://buildkite.com/ray-project/release/builds/65341#019a15f2-6b2e-474b-9ded-e4ed4d9fe246

All other release tests passing

…ine backend (ray-project#57194) Signed-off-by: DPatel_7 <dpatel@gocommotion.com> Co-authored-by: DPatel_7 <dpatel@gocommotion.com> Signed-off-by: xgui <xgui@anyscale.com>

…ine backend (ray-project#57194) Signed-off-by: DPatel_7 <dpatel@gocommotion.com> Co-authored-by: DPatel_7 <dpatel@gocommotion.com>

…ine backend (ray-project#57194) Signed-off-by: DPatel_7 <dpatel@gocommotion.com> Co-authored-by: DPatel_7 <dpatel@gocommotion.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

…ine backend (ray-project#57194) Signed-off-by: DPatel_7 <dpatel@gocommotion.com> Co-authored-by: DPatel_7 <dpatel@gocommotion.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>

Blaze-DSP requested review from a team as code owners October 4, 2025 20:49

Blaze-DSP force-pushed the master branch from e99d085 to a1ff047 Compare October 4, 2025 20:50

Blaze-DSP requested a review from a team as a code owner October 4, 2025 20:50

gemini-code-assist bot reviewed Oct 4, 2025

View reviewed changes

This comment was marked as outdated.

Sign in to view

Blaze-DSP force-pushed the master branch from b7b0ac2 to 07b25e1 Compare October 4, 2025 21:04

This comment was marked as outdated.

Sign in to view

eicherseiji requested review from Copilot and eicherseiji October 4, 2025 21:45

Copilot AI reviewed Oct 4, 2025

View reviewed changes

kouroshHakha reviewed Oct 4, 2025

View reviewed changes

ray-gardener bot added serve Ray Serve Related Issue docs An issue or change related to documentation llm community-contribution Contributed by the community labels Oct 5, 2025

Blaze-DSP requested a review from a team as a code owner October 7, 2025 20:32

This comment was marked as outdated.

Sign in to view

kouroshHakha reviewed Oct 8, 2025

View reviewed changes

doc/source/serve/llm/quick-start.rst Outdated Show resolved Hide resolved

doc/source/serve/llm/quick-start.rst Outdated Show resolved Hide resolved

python/ray/llm/_internal/serve/core/configs/openai_api_models.py Show resolved Hide resolved

kouroshHakha assigned eicherseiji Oct 8, 2025

eicherseiji reviewed Oct 9, 2025

View reviewed changes

Blaze-DSP force-pushed the master branch from eb9ae50 to 9367711 Compare October 9, 2025 16:55

This comment was marked as outdated.

Sign in to view

Blaze-DSP force-pushed the master branch from 1b3cd30 to 240a8b6 Compare October 9, 2025 17:40

This comment was marked as outdated.

Sign in to view

gemini-code-assist bot reviewed Oct 23, 2025

View reviewed changes

kouroshHakha reviewed Oct 23, 2025

View reviewed changes

kouroshHakha changed the title ~~Transcription API~~ [serve][llm][transcription] Add support for Transcription in vLLM engine backend Oct 23, 2025

Blaze-DSP and others added 2 commits October 24, 2025 00:20

Merge branch 'master' into master

1773359

review updates

4095f75

Signed-off-by: DPatel_7 <dpatel@gocommotion.com>

cursor bot reviewed Oct 23, 2025

View reviewed changes

kouroshHakha approved these changes Oct 23, 2025

View reviewed changes

richardliaw approved these changes Oct 23, 2025

View reviewed changes

DPatel_7 and others added 2 commits October 24, 2025 12:22

test fix

57e323a

Signed-off-by: DPatel_7 <dpatel@gocommotion.com>

Merge branch 'master' into master

d6f4183

cursor bot reviewed Oct 24, 2025

View reviewed changes

doc updates

05cf83e

Signed-off-by: DPatel_7 <dpatel@gocommotion.com>

cursor bot reviewed Oct 24, 2025

View reviewed changes

kouroshHakha approved these changes Oct 24, 2025

View reviewed changes

kouroshHakha merged commit ca1f7d9 into ray-project:master Oct 24, 2025
5 of 6 checks passed

Conversation

Blaze-DSP commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kouroshHakha left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Blaze-DSP commented Oct 7, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

kouroshHakha left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eicherseiji left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

cursor bot commented Oct 23, 2025

Bug: Batched Response Type Handling Issue

Uh oh!

cursor bot commented Oct 23, 2025

Bug: File Stream Exhaustion in Transcription Method

Uh oh!

cursor bot commented Oct 23, 2025

Bug: Streaming Response Type Inconsistency

Uh oh!

cursor bot commented Oct 23, 2025

Bug: Bug

Uh oh!

kouroshHakha commented Oct 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 23, 2025

Blaze-DSP commented Oct 4, 2025 •

edited

Loading